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Abstract 



Multiple kernel learning (MKL), structured sparsity, and multi-task learning have recently received 
considerable attention. In this paper, we show how different MKL algorithms can be understood 
as applications of either regularization on the kernel weights or block-norm-based regularization, 
which is more common in structured sparsity and multi-task learning. We show that these two 
regularization strategies can be systematically mapped to each other through a concave conjugate 
operation. When the kernel-weight-based regularizer is separable into components, we can nat- 
urally consider a generative probabilistic model behind MKL. Based on this model, we propose 
learning algorithms for the kernel weights through the maximization of marginal likelihood. We 
show through numerical experiments that ^2-norm MKL and Elastic-net MKL achieve comparable 
accuracy to uniform kernel combination. Although uniform kernel combination might be preferable 
from its simplicity, ^ 2 -norm MKL and Elastic-net MKL can learn the usefulness of the informa- 
tion sources represented as kernels. In particular, Elastic-net MKL achieves sparsity in the kernel 



weights. 
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1. Introduction 



In many learning problems, the choice of feature representation, descriptors, or kernels plays a 
crucial role. The optimal representation is problem specific. For example, we can represent a web 
page as a bag-of-words, which might help us in classifying whether the page is discussing politics 
or economy; we can also represent the same page by the links provided in the page, which could be 
more useful in classifying whether the page is supporting political party A or B. Similarly in a visual 
categorization task, a color-based descriptor might be useful in classifying an apple from a lemon 
but not in discriminating an airplane from a car. Given that there is no single feature representation 
that works in every learning problem, it is crucial to combine them in a problem dependent manner 
for a successful data analysis. 

In this paper, we consider the problem of combining multiple data sources in a kernel-based 
learning framework. More specifically, we assume that a data point x G X lies in a space X and we 
are given M candidate kernel functions k m : X x X — > M. (m = 1, . . . , M). Each kernel function 
corresponds to one data source. A conical combination of k m (m = 1, . . . , M) gives the combined 
kernel function k = Yl m =l drnk m , where d m is a nonnegative weight. Our goal is to find a good set 
of kernel weights based on some training examples. 

Various approaches have been proposed for the above problem under the name multiple kernel 
learning (MKL) (Lanckriet et al., 2004; Bach et al., 2004; Zien and Ong, 2007; Varma and Ray, 
2007; Aflalo et al., 2009; Gehler and Nowozin, 2009; Kloft et al., 2009; Longworth and Gales, 2009; 
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Cortes et al., 2009). Recently, Kloft et al. (2009, 2010) have shown that many MKL approaches 
can be understood as application of the penalty-based regularization (Tikhonov regularization) or 
constraint-based regularization (Ivanov regularization) on the kernel weights d m . Meanwhile, there 
is a growing interest in learning under structured sparsity assumption (Yuan and Lin, 2006; Argyriou 
et al, 2008; Huang et al., 2009; Jacob et al, 2009; Jenatton et al., 2009), which employs another 
regularization based on the so-called block-norm. 

Therefore, a natural question to ask is how these two regularization strategies are related to 
each other. For simple cases the correspondence is well known (see Bach et al. (2004, 2005); 
Kloft et al. (2009)). Moreover, in the context of structured sparsity, Micchelli et al. (2010) have 
proposed a more sophisticated class of penalty functions that employ both Tikhonov and Ivanov 
regularizations on the kernel weights and have shown the corresponding block-norm regularization. 
The first contribution of this paper is to show that under some mild assumptions the kernel-weight- 
based regularization and the block-norm-based regularization can be mapped to each other in a 
systematic manner through a concave conjugate operation. 

All the regularization strategies we discussed so far is formulated as convex optimization prob- 
lems. The second contribution of this paper is to propose a nonconvex regularizer based on the 
marginal likelihood. Although the overall minimization problem is nonconvex, we propose a itera- 
tive algorithm that alternately minimize two convex objectives. Although Bayesian approaches have 
been applied to MKL earlier in a transductive nonparametric setting by Zhang et al. (2004), and a 
setting similar to the relevance vector machine (Tipping, 2001) by Girolami and Rogers (2005); 
Damoulas and Girolami (2008), our formulation is more coherent with the correspondence between 
Gaussian process classification/regression and kernel methods (Rasmussen and Williams, 2006). 
Note that very recently Archambeau and Bach (2010); Urtasun (2010) also studied similar models. 

This paper is structured as follows. In Section 2, we start from analyzing learning with fixed 
kernel combination. Then we discuss the two regularization strategies (kernel-weight-based reg- 
ularization and block-norm-based regularization). Finally, we present our main result on the cor- 
respondence between the two formulations. In Section 3, we start from viewing the separable 
kernel-weight-based model as a hierarchical maximum a posteriori(MAP) estimation problem, and 
propose an empirical Bayesian approach for the same model. Furthermore, we show a connection 
to the general framework we discuss in Section 2. We numerically compare the proposed empiri- 
cal Bayesian MKL and various MKL models on visual categorization tasks from the Caltech 101 
dataset (Fei-Fei et al., 2004) using 1,760 kernel functions. Finally, we summarize our contributions 
in Section 5. 

2. Multiple kernel learning frameworks and their connections 

In this section, we first consider the problem of learning a classifier with a fixed kernel combination. 
Then we extend this framework to jointly optimize the kernel weights together with the classifier. 
Second, we consider the block-norm based regularization, which have been discussed in structured 
sparsity literature (including group lasso) (Yuan and Lin, 2006; Argyriou et al., 2008; Huang et al., 
2009; Jacob et al., 2009; Jenatton et al, 2009). Our main concern is in how these two formulations 
are related to each other. We show two theorems (Theorem 1 and 7) that map the two formula- 
tions. Using these theorems, we show that previously proposed regularized MKL models can be 
systematically transformed from one formulation to another. 
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2.1 Learning with fixed kernel combination 

We assume that we are given N training examples (xi,yi)f =1 where Xj belongs to an input space X 
and yi belongs to an output space y (usual settings are y = {±1} for classification and y = R for 
regression). 

We first consider a learning problem with fixed kernel weights. More specifically, we fix non- 
negative kernel weights d\ , cfe , • • • , du and consider the RKHS % corresponding to the combined 
kernel function k = Ylm=i dmk m . The squared RKHS norm of a function / in the combined RKHS 
% can be represented as follows: 

M \\f i|2 M 

\\n 2 u-= -n ^M^k, s.t./=x;/m, a) 

where % m is the RKHS that corresponds to the kernel function k m . If d m = 0, the ratio 
\\fm\\\i m /d m is defined to be zero if ||/ m ||-H m = and infinity otherwise. See Sec 6 in Aron- 
szajn (1950), and also Lemma 25 in Micchelli and Pontil (2005) for the proof. We also provide 
some intuition for a finite dimensional case in Appendix A. 

Using the above representation, a supervised learning problem with a fixed kernel combination 
can be written as follows: 

minimize £ I L, £^ + ft) + % £ ^T^' ( 2 ) 

}i€Hi, • \ / z — ' a m 

-Jm£Hm, 1=1 m=1 

where f:RxM->Misa loss function and we assume that ^ is convex in the second argument; 
for example, the loss function can be the hinge loss £H(yi,Zi) = max(0, 1 — yiZi), or the quadratic 
loss e Q (yi, Zi) = {yi - Zi) 2 /(2a%). 

It might seem that we are making the problem unnecessarily complex by introducing M func- 
tions f m to optimize instead of simply optimizing over /. However, explicitly handling the kernel 
weights enables us to consider various regularization strategies on the weights as we see in the next 
subsection. 



2.2 Kernel-weight-based regularization 

Now we are ready to also optimize the kernel weights d rn in the above formulation. Clearly there is 
a need for regularization, because the objective (2) is a monotone decreasing function of the kernel 
weights d m . Intuitively speaking, d m corresponds to the complexity allowed for the mth regression 
function f m ; the more complexity we allow, the better the fit to the training examples becomes. 
Thus without any constraint on d m , we can get a severe overfitting problem. 

Let h : M.+ — > 1 U {+00} be a function from a non-negative real vector d € to a real 
number. One way to penalize the complexity is to minimize the objective (2) together with the 
regularizer h(d) as follows: 

minimize £ t U, £^ =1 f m (x t ) + b) + £ (f) Ml^L. + h{d )Y ( 3 ) 

b£R, 1=1 V m=1 / 
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Note that if h is a convex function, the above optimization problem is jointly convex in f m and d m 
(see Boyd and Vandenberghe (2004, Example 3.18)) 

Example 1 (£ p -norm MKL via Tikhonov regularization) Let h{d) = J2m=idm/P- Then we 
have 

minimize f lU, Y™, f m (x t ) + b) + ? V f M^L + <&\ 



beR, 



j=l m=l x 

The special case p = 1 was considered earlier in Varma and Ray (2007). 

Example 2 (£ p -norm MKL via Ivanov regularization) Let h(d) = if J2m=i d m < 1 and 
h(d) = +oo otherwise. Then we have 

N „ M II i i|2 M 

minimize £ £ (y u / m (*i) + ft) + ^ £ s.t. £ <fc < 1. 

2=1 m=l m=l 



This formulation was considered by Kloft et al. (2008, 2009). The special case p = 2 was considered 
by Cortes et al. (2009). 

Example 3 (Multi-task learning) In this example, the linear combination inside the loss term is 
defined in a sample-dependent manner. More precisely, we assume that there are n tasks, and 
each sample is associated with the l(i)th task. Let % be an RKHS over the input space X. We 
consider M = n + 1 functions f± , . . . , fu £ H to model the task dependent component and the 
task independent component. The first n functions f\ , . . . , f n £ H represent the task dependent 
components of the classifiers, and the last function fu £ H represents the component that is 
common to all tasks. Accordingly, the optimization problem can be expressed as follows: 

n (7 M ||/ m || 2 M 

minimize (yu fi(i){xi) + f M (xi) + b) + — V - Um s.t. Vd ra <l. 

beR, 1=1 m=1 m=1 

Evgeniou and Pontil (2004); Evgeniou et al. (2005) proposed a related model that only has one pa- 
rameter; they used d m = nX for m = 1, . . . , n and Yl m =i d m /n 2 +dM < 1 instead of the constraint 
in the above formulation. However they did not discuss joint optimization of the hyperparameter A 
and the classifier. 

Example 4 (Wedge penalty) Let h(d) = J2m=i dm if d m > d m+ \for allm = 1, . . . , M — 1, and 

h(d) = +oo otherwise. Then we have 

N „ M /nr ||2 \ 

minimize £ / (y u ^ fm(xi) + b) + - £ + d m s.t. d G W, 

deR^' 

where 

W = {d:deR^,d m > d m+ i,m = 1, . . . , M — 1}. 
This penalty function was considered by Micchelli et al. (2010). 
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2.3 Block-norm-based regularization 

Historically Bach et al. (2004, 2005) and Micchelli and Pontil (2005) pointed out that the problem of 
learning kernel weights and classifier simultaneously can be reduced to the problem of learning the 
classifier under some special regularizer, which we call block norm based regularizer in this paper. 
Generalizing the presentation in the earlier papers, we define the block-norm-based regularization 
as follows: 

N 

minimize £^ =1 f m (x t ) + b) + Cg{\\h\\ 2 Hl ,. ■ ■ > J, (4) 

1=1 

where g : — > R is called the block-norm-based regularizer. 

Example 5 (Block 1-norm MKL) Let g(x) = J2m=i \f x ~m- Then we have 



N _ 

S™=i Uxi) + b) +C^2 m=i \\f m \\ Hm . (5) 

i=\ 



minimize 

/lS'Hi,..., Sm&Lm 



This formulation was discussed earlier in Bach et al. (2005). When all the kernels are linear kernels 
defined on non-overlapping subset of input variables, this is equivalent to the group lasso Yuan and 
Lin (2006). 

Example 6 (Overlapped group lasso) Suppose that we have D input variables x = 
(x W , . . . , x( D ) ) T and kernel functions k m are defined as overlapped linear kernels as follows: 

k m (x,y)=Y J X {l) y (l) (m = l,...,M), (6) 

where Q m (m = 1, . . . , M) is a subset of (overlapped) indices from {1, . . . , D}. Introducing 
weight vectors w m £ Rl 9m (m = 1,...,M), we can rewrite f m (x) = w^x^ 5 ™^, where 
x {Sm) — (x®)J e . In addition, \\fm\\u m = ||n>m|| 2 - Then employing the same regularizer as 
in Example 5, we have 

N 

minimize £ £ t^=i*? m) ) + & ) +C E JKI' ™ 

This formulation was considered by Jacob et al. (2009). Except for the particular choice of the 
kernel function (6), this is a special case of the block 1-norm MKL in Example 5. 

M 



Example 7 (Elastic-net MKL) Let g(x) = J2Z=i (( x ~ tyyfi^i + |«m)- ^en we have 

minimize U, £^ =1 f m (x t ) + b) + C £ (1 - A)||/ m || Wm + - ||/ m ||^ . (8) 

6eR * =1 m=1 



This formulation was discussed earlier in Shawe-Taylor (2008); Longworth and Gales (2009); 
Tomioka and Suzuki (2010). Note that the elastic-net MKL (8) reduces to the block 1-norm 
MKL (5)) for A = and the uniform- weight combination (d m = 1 in Equation (2)) for A = 1. 
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2.4 Connection between the two formulations 

In this subsection, we present two theorems that connect the kernel-weight-based regularization 
(Section 2.2) and the block-norm-based regularization (Section 2.3). 

The following theorem states that under some assumptions about the regularizer h, we can 
analytically eliminate the kernel weights d from the optimization problem (3). 

Theorem 1 We assume that the kernel weight based regularizer h is convex, zero at the origin, and 
satisfies the generalized monotonicity in the following sense: for x,y 6 satisfying x m < y m 
(m = 1, . . . , M), h satisfies 

h(x) < h(y). (9) 

Moreover, let h(y) := —h(l/yi, . . . , 1/um)- Then h is a concave function and the optimization 
problem (3) can be reduced to the following one: 



nmminzr (y u £^ =1 f m ( Xi ) + b) + Cg{\\h\\ 2 Hl ,. . . , ||/m|& J, (10) 



N 

where 



g{x) = \ inf ■ (x T y-h(y)) (11) 

is the concave conjugate function ofh (divided by two). Moreover, if g is differentiable, the optimal 
kernel weight d rn is obtained as follows: 

dm = / 2 ^(II/iII^,-,II/mII^) \ 1 

m \ dx m J 

Proof In order to show the concavity of h, we show the convexity of h(l/yi, . . . , l/yu)- This is 
a straightforward generalization of the scalar composition rule in (Boyd and Vandenberghe, 2004, 
p84). Let <fi be any scalar convex function. Then 

h(<f>(0x! + (1 - 9) yi ), cp{6x M + (1 - 0)y M )) 

< h(8<p( Xl ) + (1 - 9)<p( yi ), e<p( XM ) + (i - e)4>( yM )) 

< 9h(<p( Xl ), 4>(x M )) + (i - e)h(<p( yi ), 4>(y M )), 

In the second line, we used the convexity of <j) and the monotonicity of h in Equation (9). Therefore, 
letting 4>(x) = 1/x, we have the convexity of h(l/y\, . . . , 1/yu) and the concavity of h. Now 
if the infimum in the minimization (11) exists in Rf, we have the block-norm formulation (10) 
by substituting x m = ||/ m ||^ and y m = l/d m for m = 1,...,M. The case y m = +oo 
for some m happens only if x m = ||/m||^ m = because h(y) < 0, and this corresponds 
to d m = in the original formulation (3). The last part of the theorem follows from fact that 
h(y) = inf^gjjM (x T y — 2g(x)) and the optimality condition 

1 2 Wl|^.---,ll/M||^) _ a 

dm, 3%rr), 
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Note that the monotonicity assumption (9) and the convexity of h form a strong sufficient condition 
for the above correspondence to hold. What we need is the concavity of h. In the context of 
variational Bayesian inference, such condition has been intensively studied. See Section 3 and Wipf 
and Nagarajan (2009); Seeger and Nickisch (2008). 

In the particularly simple cases where the kernel-weight-based regularizer h is separable, the 
block-norm-based regularizer g is also separable as in the following corollary. 

Corollary 2 (Separable regularizer) Suppose that the kernel-weight-based regularizer is defined 
as a separable function (with a slight abuse of notation) as follows: 

M 

h scp (d) = ^2 Kdm), 

m=l 

and h is convex and nondecreasing. Then, the corresponding block-norm-based regularizer is also 
separable and can be expressed as follows: 

1 M 

g sep (x) = - Y] inf (x m y m - h(y m )) , 

4 , Vm>0 V / 

m=l 

where h(y) = —h(l/y). 

Proof The proof is straightforward and is omitted (see e.g., Boyd and Vandenberghe (2004, p95)). 
■ 

Separable regularizes proposed earlier in literature are summarized in Table 1. 

The monotonicity assumption (9) holds for the regularizes defined in Examples 1-4. Therefore, 
we have the following corollaries. 

Corollary 3 (Block (/-norm formulation via Tikhonov regularization) Applying Theorem 1 to 
the ip-norm MKL in Example 1, we have 

n Cm 

Aen^Zte^^ ^ fm{Xl) + b ) + l £™=i (12) 
i=i 

where q = 2p/ (1 + p). 

Proof Note that the regularizer h(d) = Ylm=i dm/p in Example 1 is separable as in Corollary 2. 
Therefore, we define h{y m ) = —yVn/p, and accordingly we have g(x m ) = ^Xm' 1+!> ', from 
which we obtain Equation (12). ■ 



Corollary 4 (Block g-norm formulation via Ivanov regularization) Applying Theorem 1 to the 
£ p -norm MKL in Example 2, we have 



minimize 

fieHi,...J M eH M ,bm 

where q = 2p/ (1 + p). 



X> (w, £™=i fmte) + b) + § \\U q nS /q , (13) 

i=i v ' 
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Proof For the regularizer defined in Example 2, we have 

Hy)-- 



o (if£™=iy™ p <i), 

— oo (otherwise). 



Then the block-norm-based regularizer g (11) is defined as the minimum of the following con- 
strained minimization problem 



1 M 

g(x) = -mm x T y s.t. ^ y m p < 1. 

2 y€R M 



m=l 



We define the Lagrangian C = ^x T y + ^-(J2m=iym P -^)- Taking the derivative of the Lagrangian 



2p 

and setting it to zero, we have 



In addition, the multiplier rj is obtained as 7] = ^X)m=i X m^ 1+P ^ ^ ■ Combining the above 
two expressions, we have 



1 / M 

\m=l 



\ (1+P)/P 

-p/(i+p) I 



from which we obtain Equation (13). 



Except for the special structure inside the loss term, the multi-task learning problem is a special 
case of the £ p -norm MKL with p = 1. Therefore, we have the following corollary. 

Corollary 5 (Multi-task learning) Applying Theorem 1 to the multi-task learning problem in Ex- 
ample 3, we have 

N C ( M \ 2 

minimize V I { Vi , f l{i) (xi) + f M {xi) + b) + — V ||/m||w m • (14) 

hew 1=1 \m=l / 



Moreover, the one-parameter case considered in Evgeniou and Pontil (2004); Evgeniou et al. 
(2005), we have 

N C ( I \ 2 

minimize (y h f m { Xi ) + f M ( Xi ) +b) + -(J± YZ=i \\fm\\n m + II/m||h m • 



6gR, 
deR. 
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Proof The first part is a special case of Corollary 4. The one-parameter case can be derived using 
Jensen's inequality as follows. Let d m = nX (m = 1, . . . ,n) and cIm = 1 — A. Using Jensen's 
inequality, we have 



\\fm\\u m , WfM\\u M _ x ( V™^™ =1 ^ m ^ m \ <n xx f UmWhm^ " 



+ W = Y X +(1 - A) V 1-A 

> Em=l \\fm\\u m + H/mIIwm 



The equality holds when (1 - A)/A = \\f m \\H M /\ n ELi 



Note that in the one-parameter case discussed in Evgeniou and Pontil (2004); Evgeniou et al. 
(2005), the solution of the joint optimization of task similarity d m and the classifier results in either 

\j n ELi H/ m llw m = ( au tne tasks are the same) or ||/m||-h m = (all the tasks are different). 
For the general problem (14), some ||/ m ||^ m may become zero but not necessarily all tasks. 

Corollary 6 (Wedge penalty) Applying Theorem 1 to the Wedge regularization in Example 4, we 
have 

N 



, E 1 £™=i f^i) + b)+ Cg(\\h\\ 2 Hl , . . . , WfufnJ- (15) 

where 



minimize 

fieHi,...,f M eH M ,bm^- . 

i=i 



M 

g(xi, . . . ,X M ) = SUp E \/(! + Vm-l ~ Vm)x m , (16) 

m, -,v m-i>o m=1 



With 7]o = 7]M = 0. 

Proof For the regularizer defined in Example 4, we have 

M 

We define the Lagrangian C as follows: 



1 M 

9{x) = -z min {x m y m + yj) s.t. y n ^ 1 <y m 1 (m = l,...,M-l). 



, , M M-l \ 

c = 9 ( E ( Xm?/m + y ™ ) + E - ) • 

rrt=l m=l / 



2 

Minimizing the Lagrangian with respect to y, we have 



Vn 



1 + ??m.-l - f]n 



where we define rjo = t/m = for convenience. Substituting the above expression back into the 
Lagrangian and forming the dual problem we obtain Equation (16). ■ 



The following theorem is useful in mapping algorithms defined in the block-norm formulation 
(Section 2.3) back to the kernel-weight-based formulation (Section 2.2). 
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Theorem 7 (Converse of Theorem 1) If the block-norm-based regularizer g in formulation (4) is 
a concave function, we can derive the corresponding kernel-weight based regularizer h as follows: 

hid!,. ..,d M ) = ~2g* (l/(2di), . . . , l/(2d M )) , 
where g* denotes the concave conjugate of g defined as follows: 

g*(y)= inf \(x T y - g(x) ) . 

Proof The converse statement holds because 

h(di, . . .,d M ) = -h(l/di,. . . , l/d M ) 

= -(2g)*(l/d 1 ,...,l/d M ) 
= -20*(l/(2di),...,l/(2d M )), 
where g* is the concave conjugate of g. ■ 

Theorem 7 can be used to derive the kernel-weight-based regularizer h corresponding to the 
elastic-net MKL in Example 7 as in the following corollary. 

Corollary 8 (Kernel-weight-based formulation for Elastic-net MKL) The block-norm-based 
regularizer g defined in Example 7 is a concave function, and the corresponding kernel-weight- 
based regularizer can be obtained as follows: 



!i {l-Xfd 



rn 



Xd m 

m=l 

Proof Since the regularizer g defined in Example 7 is a convex combination of two concave func- 
tions (square-root and linear functions), it is clearly a concave function. Next, noticing that the 
regularizer g is separable into each component, we obtain its concave conjugate as follows: 

h{ym)= inf [ x m y m ~ 2 ( {I ~ X)V^ + ^Xm)) ■ (17) 

Xm>0 \ \ I J J 

Taking the derivative with respect to x m , we have 

Vm -j= - A = 0. 

Substituting the above expression back into Equation (17), we have 

kym) = J±z*r. 

Vm — A 

Therefore 



M M • \fd, 



h{d u ...,d M ) = -J2 kvd m ) = J2^i 



2,. 

m 



Xd r , 

m=l m=l 



Note that in the special case A = (corresponding to Example 5), the kernel-weight-based regular- 
izer (17) reduces to the ^,-norm MKL with p = 1 in Example 1. See also Table 1. 
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Table 1: Correspondence between the regularizer h in the kernel-weight-based regularization (3) 
and the concave function g in the block-norm formulation (4). 7[ 0) i] denotes the indica- 
tor function of the interval [0, 1]; i.e., i"[ 0;1 ](x) = (if x £ [0, 1]), and 7[ 01 ](x) = oo 
(otherwise). 



MKL model 




h(d m ) 


Optimal kernel weight 


block 1-norm MKL 




dm 


dm II /mil 'Km 


£p-norm MKL 


i±p p/(i+p) 

2p x 


dm/p 


, _ ,, , ||2/(l+p) 
u m — IIJm||-^ m 


Uniform-weight MKL 
(block 2-norm MKL) 


x/2 


^[0,1] (^m) 


^m = 1 


block (/-norm MKL 

(9>2) 


1-5/2 
q X 


g-2 .-9/(9-2) 
9 " m 


dm = |/m \\ H * 


Elastic-net MKL 


(1 - A)V^+ 


(1-A)^d m 
1— Ad m 


J _ ll/m|"H m 

fl ™ - (l-A)+A||/ m ||« m 



3. Empirical Bayesian multiple kernel learning 

For a separable regularizer (see Corollary 2), the kernel-weight-based formulation (3) allows a prob- 
abilistic interpretation as a hierarchical maximum a posteriori (MAP) estimation problem as follows: 

£ t U, Em=l /m(*i) {^f^ + ■ (18) 

f >. ■ i / z \ u m / 



minimize 

/i£Mi,...,/m£?<ju, . 



The loss term can be considered as a negative log-likelihood. The first regularization term 
\\fm\\\i m / d m can be considered as the negative log of a Gaussian process prior with variance scaled 
by the hyper-parameter d m . The last regularization term h(d m ) corresponds to the negative log of 
a hyper-prior distribution p(d m ) oc exp(— h{d m )). In this section, instead of a MAP estimation, we 
propose to maximize the marginal likelihood (evidence) to obtain the kernel weights; we call this 
approach empirical Bayesian MKL. 

We rewrite the (separable) kernel- weight regularization problem (18) as a probabilistic genera- 
tive model as follows: 

d m ~ -7T exp(-h(d m )) (m = l,...,M), 
^1 

f m ~GP{f m ;0,d m k m ) (m = 1, . . . , M) 
Vi ~ -7T exp(-^(yj, fi(xi) + f 2 (xi) H h /m(^))), 

^2 

where Zi and Z 2 are normalization constants; GP(f; 0, fc) denotes the Gaussian process (Ras- 
mussen and Williams, 2006) with mean zero and covariance function k. We omit the bias term for 
simplicity. 

When the loss function is quadratic £(yi, Z{) = (yi — Zj) 2 /(2cr^), we can analytically integrate 
out the Gaussian process random variable (f m )m=i an( ^ compute the negative log of the marginal 
likelihood as follows: 

- logp(y\d) = \y T K(d) l y + 1 log \K[d)\ (19) 



11 



where d = (d±, . . . , ^m) 7 , K r , 



(km{%i,Xj))ij=i is the Gram matrix, and 



M 



K{d) := <JyI N + ^ d m K m . 



m=l 



Using the quadratic upper-bounding technique in Wipf and Nagarajan (2008), we can rewrite 
the marginal likelihood (19) in a more meaningful expression as follows: 



log p(y\d) 




M 

&ylN + ]C d m K m 

m=l 



(20) 



where f m := (f m (xi), . . .,f m (x N )) T , and ||/ m ||^ m = f m K m l f m . See Wipf and Nagarajan 
(2008, 2009) for more details. Now we can see that the minimization of marginal likelihood (20) 
is a special case of kernel- weight-based regularization (3) with a concave, nonseparable regularizer 
defined as 



h{d) = log 



M 

a y lN + d m K r 

m=l 



We ask whether we can derive the block-norm formulation (see Section 2.3) for the marginal likeli- 
hood based MKL. Unfortunately Theorem 1 is not applicable because h is not convex. The result in 
Wipf and Nagarajan (2009, Appendix B) suggests that reparameterizing d m = exp(r/ m ), we have 
the following expression for the block-norm-based regularizer: 



9{x) 



— inf 

2 v eR M 



f M 

E 

\m=\ 



+ log 



^ 2 r 
a v ±N 



M 

+ E 



(21) 



Importantly, the above minimization is a convex minimization problem. However, this does not 
change the fact that the original minimization problem (20) is a non-convex one. In fact, the term 
||/ m ||^ rm e _,?m is convex for rf m but not jointly convex for f m and r] m . 

Instead of directly minimizing (e.g., by gradient descent) the marginal likelihood (19) to obtain 
a hyperparameter maximum likelihood estimation, we employ an alternative approach known as 
the MacKay update (MacKay, 1992; Wipf and Nagarajan, 2009). The MacKay update alternates 
between the minimization of right-hand-side of Equation (20) with respect to f m and an (approx- 
imate) minimization of Equation (21) with respect to r] m . The minimization in Equation (20) is a 
fixed kernel- weight learning problem (see Section 2.1) and for the special case of quadratic loss, it 
is simply a least-squares problem. For the approximate minimization of Equation (21) with respect 
to rim, we simply take the derivative of Equation (21) and set is to zero as follows: 



Xm,6 



from which we have the following update equation: 



Tr (Vljv + E™=i e^K m )- l e^K m ) 
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Therefore, substituting d m = e Vm and x m = ||/ m ||^- , we obtain the following iteration: 



(/m)m=i <~ argmm 



(fm) 



M 

mJm- 



( 1 V^M 2 1 r-(M 

M^^ - !^ i *"» +oL 1 

\ ZCJ, z — 'm=l Z z — 'm=l 

l \ y 



If II 2 
I J mil AT n 

dm. 



\f " 2 



mil Kr, 



Tr ((<T 2 JjV + Em=l dmKm^dmKm) 



(m = l,...,M). 



(22) 



(23) 



The convergence of this procedure is not established theoretically, but it is known to converge 
rapidly in many practical situations (Tipping, 2001). 



4. Numerical experiments 

In this section, we numerically compare MKL algorithms we discussed in this paper on three binary 
classification problems we have taken from Caltech 101 dataset (Fei-Fei et al., 2004). We have gen- 
erate 1,760 kernel functions by combining four SIFT features, 22 spacial decompositions (including 
the spatial pyramid kernel), two kernel functions, and 10 kernel parameters 1 . More precisely, the 
kernel functions were constructed as combinations of the following four factors in the prepossessing 
pipeline: 

• Four types of SIFT features, namely hsvsift (adaptive scale), sift (adaptive scale), sift (scale 
fixed to 4px), sift (scale fixed to 8px). We used the implementation by van de Sande et al. 
(2010). The local features were sampled uniformly (grid) from each input image. We ran- 
domly chose 200 local features and assigned visual words to every local features using these 
200 points as cluster centers. 

• Local histograms obtained by partitioning the image into rectangular cells of the same size in 
a hierarchical manner; i.e., level-0 partitioning has 1 cell (whole image) level- 1 partitioning 
has 4 cells and level-2 partitioning has 16 cells. From each cell we computed a kernel func- 
tion by measuring the similarity of the two local feature histograms computed in the same 
cell from two images. In addition, the spatial-pyramid kernel (Grauman and Darrell, 2007; 
Lazebnik et al., 2006), which combines these kernels by exponentially decaying weights, was 
computed. In total, we used 22 kernels (=one level-0 kernel + four level- 1 kernels + 16 level- 
2 kernels + one spatial-pyramid kernel). See also Gehler and Nowozin (2009) for a similar 
approach. 

• Two kernel functions (similarity measures). We used the Gaussian kernel: 

3=1 1 

for 10 band- width parameters (7's) linearly spaced between 0.1 and 5 and the x 2 -kernel: 

%(*>,„(,-)) =exp(- 7 'x; ( « ( f> » 2 ) 

1. Preprocessed data is available from http : / /www . ibis . t . u-tokyo . ac . jp/ryotat /prmuO 9 /data/. 
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Figure 1: Canon vs Cup from Caltech 101 dataset. 



for 10 band-width parameters (7's) linearly spaced between 0.1 and 10, where q(x), q(x') £ 
N™ are the histograms computed in some region of two images x and x'. 

The combination of 4 sift features, 22 spacial regions, 2 kernel functions, and 10 parameters resulted 
in 1,760 kernel functions in total. 

We compare uniform kernel combination, block 1-norm MKL, Elastic-net MKL with A = 0.5, 
Elastic-net MKL with A chosen by cross validation on the training-set, £ 2 - norm MKL Cortes et al. 

(2009) ; Kloft et al. (2009) and empirical Bayesian MKL. ^2-norm MKL uses the hinge loss, and the 
other MKL models (except empirical Bayesian MKL) use the logit loss. We also include uniform 
kernel combination with the squared loss to make the comparison between the empirical Bayesian 
MKL and the rest possible. Since the difference between Uniform (logit) and Uniform (square) is 
small, we expect that the discussion here is not specific to the choice of loss functions. For the 
Elastic-net MKL (8), we either fix the constant A as A = 0.5 (Elastic (0.5)) or we choose the value 
of A from {0, 0.2, . . . , 0.8, 1}. MKL models with the logit loss are implemented in SpicyMKL 2 
toolbox (Suzuki and Tomioka, 2009). For the empirical Bayesian MKL, we use the MacKay up- 
date (22)-(23). We used the implementation of ^-norm MKL in Shogun toolbox Sonnenburg et al. 

(2010) . The regularization constant C was chosen by 2 x 4-fold cross validation on the training-set 
for each method. We used the candidate {0.0001,0.001,0.01,0.1, 1, 10} for all methods except 
£ 2 -norm MKL and {0.01, 0.1, 1, 10, 100, 1000} for the ^-norm MKL. 

Figures 1-3 show the results of applying different MKL algorithms. We can see that overall £2- 
norm MKL, Elastic-net MKL, and uniformly-weighted MKL perform favorable compared to other 
MKL methods. Empirical Bayesian MKL and block 1-norm MKL tend to perform worse than the 
above three methods, especially when the number of samples per class is smaller than 20. However, 

2. Available from http : / / www . simplex . t . u-tokyo . ac . jp/~s-taiji / software/ SpicyMKL/. 
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Figure 2: Cannon vs Ant from Caltech 101 dataset. 
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Figure 3: Cup vs Ant from Caltech 101 dataset. 



in Figure 3, they do perform comparably for the number of samples per class above 30. Although 
Elastic-net MKL and £2 -norm MKL perform almost the same as uniform MKL in terms of accuracy, 
the right panels show that these methods can find important kernel components automatically. More 
specifically, on the "Cannon vs Cup" dataset (Figure 1), Elastic-net MKL chose 88 Gaussian RBF 
kernel functions and 792 x 2 kernel functions. Thus it prefers x 2 kernels to Gaussian RBF kernels. 
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This agrees with the common choice in computer vision literature. In addition, Elastic-net MKL 
consistently chose the band width parameter 7 = 0.1 for the Gaussian RBF kernels but it never 
chose 7 = 0.1 for the x 2 kernels; instead it averaged all x 2 kernels from 7 = 1.2 to 7 = 10. 

5. Conclusion 

We have shown that various MKL and structured sparsity models including ^,-norm MKL, Elastic- 
net MKL, multi-task learning, Wedge penalty, and overlapped group lasso can be seen as appli- 
cations of different regularization strategies. These models have been conventionally presented in 
either kernel-weight-based regularization or block-norm-based regularization. We have shown that 
these two formulations can be systematically mapped to each other under some conditions through 
a concave conjugate transformation; see Table 1. 

Furthermore, we have proposed a marginal-likelihood-based kernel learning algorithm. We have 
shown that the propose empirical Bayesian MKL can be considered to be employing a nonconvex 
nonseparable regularizer on the kernel weights. Furthermore, we have derived the expression for 
the block-norm regularizer corresponding to the proposed empirical Bayesian MKL. 

We have tested the classification performance as well as the resulting kernel weights of various 
regularized MKL models we have discussed on visual categorization task from Caltech 101 dataset 
using 1,760 kernels. We have shown that Elastic-net MKL can achieve comparable classification 
accuracy to uniform kernel combination with roughly half of the candidate kernels and provide 
information about the usefulness of the candidate kernels, ^-norm MKL also achieves similar 
classification performance and qualitatively similar kernel weights. However ^-norm MKL does 
not achieve sparsity in the kernel weights in contrast to Elastic-net MKL. Empirical Bayesian MKL 
tends to perform worse than the above two methods probably because the kernel weights it obtains 
becomes extremely sparse. One way to avoid such solution is to introduce hyper-priors for the 
kernel weights as in Urtasun (2010). 

We are currently aiming to relax the sufficient condition in Theorem 1 to guarantee mapping 
from the kernel-weight-based formulation to block-norm-based formulation. We would also like to 
have a finer characterization of the block-norm regularizer corresponding to the empirical Bayesian 
MKL (see also Wipf and Nagarajan (2008)). Theoretical argument concerning when to use sparse 
MKL models (e.g., ^i-norm MKL or empirical Bayesian MKL) and when to use non-sparse MKL 
models (£ p -norm MKL) is also necessary. 
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Appendix A. Proof of Equation (1) in a finite dimensional case 

In this section, we provide a proof of Equation (1) when Hi, ... , % m are all finite dimensional. We 
assume that the input space X consists of N points x\, . . . , xn, for example the training points. 
The function f m G H m is completely specified by the function values at the N -points f m = 
(f m (xi), . . . , f m {xN)) T ■ The kernel function k m is also specified by the Gram matrix K m = 
(k m (xi,Xj))^ j=1 . The inner product (f m ,g m ) Hm is written as (f m ,g m ) Hm = j m K m x g m , where 
g m is the TY-dimensional vector representation of g m € Ji m , assuming that the Gram matrix K m 
is positive definite. It is easy to check the reproducibility; in fact, (f m , k m (-, xi)) = f m K^K m (- 
, i) = f(xi), where K m (:,i) is a column vector of the Gram matrix K m that corresponds to the ith 
sample point Xj. 

The right-hand side of Equation (1) is written as follows: 

M f T K l f M 

Forming the Lagrangian, we have 

M f T K 1 f 

m=l ° m 
M 



= y f m K m tm +2a T - f _y M f 
m=l 

> -a T ( V M d m Km) a + 2a T f 

where the equality is obtained for 

fm — d m K m I y ^ ^ d m K m J f . 
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